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Abstract 

This paper introduces a new specialized algorithm for equilibrium Monte Carlo sampling of binary- 
valued systems, which allows for large moves in the state space. This is achieved by constructing 
self-avoiding walks (SAWs) in the state space. As a consequence, many bits are flipped in a single 
MCMC step. We name the algorithm SARDONICS, an acronym for Self-Avoiding Random Dynamics 
on Integer Complex Systems. The algorithm has several free parameters, but we show that Bayesian 
^ optimization can be used to automatically tune them. SARDONICS performs remarkably well in a 

O broad number of sampling tasks: toroidal ferromagnetic and frustrated Ising models, 3D Ising models, 

restricted Boltzmann machines and chimera graphs arising in the design of quantum computers. 

in 

^ 1 Introduction 

Ising models, also known as Boltzmann machines, are ubiquitous models in physics, machine learning 
and spatial statistics [TJ El 133 [351 [SU] . They have recently lead to a revolution in unsupervised learning 
^ ; known as deep learning, see for example [2Il EH [Ml [40] . There is also a remarkably large number of 

other statistical inference problems, where one can apply Rao-Blackwellization [231 HOI [35] to integrate 
out all continuous variables and end up with a discrete distribution. Examples include topic modeling 
and Dirichlet processes 8J, Bayesian variable selection [62], mixture models [36^ and multiple instance 
learning |33j . Thus, if we had effective ways of sampling from discrete distributions, we would solve a 
^ great many statistical problems. Moreover, since inference in Ising models can be reduced to max-SAT 

and counting-SAT problems [31 [65l [2] , efficient Monte Carlo inference algorithms for Ising models would 
be applicable to a vast domain of computationally challenging problems, including constraint satisfaction 
and molecular simulation. 

Many samplers have been introduced to make large moves in continuous state spaces. Notable ex- 
amples are the Hamiltonian and Riemann Monte Carlo algorithms [T2[1S1[IT]- However, there has been 
little comparable effort when dealing with general discrete state spaces. One the most popular algorithms 
I in this domain is the Swendsen-Wang algorithm [60;. This algorithm, as shown here, works well for sparse 

• • planar lattices, but not for densely connected graphical models. For the latter problems, acceptance rates 

. to make large moves can be very small. For example, as pointed out in [44) Ising models with Metropolis 

dynamics can require 10^^ trials to leave a metastable state at low temperatures, and such a simulation 
would take 10^° minutes. For some of these models, it is however possible to compute the rejection 
probability of the next move. This leads to more efficient algorithms that always accept the next move 
[m [IS] . The problem with this is that at the next iteration the most favorable move is often to go back 
to the previous state. That is, these samplers may often get trapped in cycles. 

To overcome this problem, this paper presents a specialized algorithm for equilibrium Monte Carlo 
sampling of binary-valued systems, which allows for large moves in the state space. This is achieved by 
constructing self-avoiding walks (SAWs) in the state space. As a consequence, many bits are flipped in a 
single MCMC step. 

We proposed a variant of this strategy for constrained binary distributions in [2^. The method 
presented here applies to unconstrained systems. It has many advancements, but more free parameters 
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than our previous version, thus making the sampler hard to tunc. For this reason, we adopt a Bayesian 
optimization strategy |43l [9l [38] to automatically tune these free parameters, thereby allowing for the 
construction of parameter policies that trade-off exploration and exploitation effectively. 

Monte Carlo algorithms for generating SAWs for polymer simulation originated with (55]. More 
recently, biased SAW processes were adopted as proposal distributions for Metropolis algorithms in [57], 
where the method was named configurational bias Monte Carlo. The crucial distinction between these 
seminal works and ours is that those authors were concerned with simulation of physical systems that 
inherently posess the self- avoidance property, namely molecules in some state space. More specifically, the 
physics of such systems dictated that no component of a molecule may occupy the same spatial location 
as another; the algorithms they devised took this constraint into account. In contrast, we are treating 
a different problem, that of sampling from a binary state-space, where a priori, no such requirement 
exists. Our construction involves the idea of imposing self-avoidance on sequences of generated states as 
a process to instantiate a rapidly-mixing Markov Chain on the binary state-space. It is therefore more 
related to the class of optimization algorithms known as Tabu Search [TF than to classical polymer SAW 
simulation methods. To our knowledge, though, a correct equilibrium Monte Carlo algorithm using the 
Tabu-like idea of self-avoidance in state-space trajectories has not yet been proposed. 

We should also point out that sequential Monte Carlo (SMC) methods, such as Hot Coupling and 
annealed importance sampling, have been proposed to sample from Boltzmann machines \24\ [54) . Since 
such samplers often use an MCMC kernel as proposal distribution, the MCMC sampler proposed here 
could enhance those techniques. The same observation applies when considering other meta-MCMC 
strategies such as parallel tempering [1^1 [13] , multicanonical Monte Carlo |U [5D] and Wang-Landau [M] 
sampling. 



2 Preliminaries 

Consider a binary- valued system defined on the state space S = {0, 1}^^, i.e. consisting of M variables 
each of which can be or 1. The probability of a state x — [xi, . . .xjsi] is given by the Boltzmann 
distribution: 

where /3 is an inverse temperature. An instance of such a system is the ubiquitous Ising model of statistical 
physics, also called a Boltzmann machine by the machine learning community. Our aim in this paper 
is the generation of states distributed according to a Boltzmann distribution specified by a particular 
energy function E{x.). 

A standard procedure is to apply one of the local Markov Chain Monte Carlo (MCMC) methodologies 
such as the Metropolis algorithm or the Gibbs (heat bath) sampler. As is well-known, these algorithms 
can suffer from issues of poor equilibration ("mixing") and trapping in local minima at low temperatures. 
More sophisticated methods such as Parallel Tempering [TB] and the flat-histogram algorithms (e.g. 
multicanonical [1] or Wang-Landau |64] sampling) can often dramatically mitigate this problem, but 
they usually still rely on local MCMC at some stage. The ideas presented in this paper relate to MCMC 
sampling using larger changes of state than those of local algorithms. They can be applied on their own or 
in conjunction with the previously-mentioned advanced methods. In this paper, we focus on the possible 
advantage of our algorithms over local methods. 

Given a particular state x S 5, we denote by 5„(x) the set of all states at Hamming distance n from 
X. For example if M = 3 and x = [1,1,1], then 5o(x) = {[1,1,1]}, 5i(x) = {[0, 1, 1], [1, 0, 1], [1, 1, 0]}, 
etc. Clearly |5„(x)| = (*^). 

We define the set of bits in two states x, y that agree with each other: let 7'(x,y) = {i\xi — i/i}. For 
instance if x = [0,1,0,1] and y = [0,0,0,1], then 7'(x,y) = {1,3,4}. Clearly, 'P(x,y) = 'P(y,x) and 
7'(x,x) = {l,2,...M}. 

Another useful definition is that of the flip operator, which simply inverts bit i in a state, i^(x, i) = 
(xi, . . . ,Xi, . . . ,xm), and its extension that acts on a sequence of indices, i.e. i^(x, ii, 12, . . . , ifc) = 
i^(F(...F(x,ii),Z2),...,Zfc). 

For illustration, we describe a simple Metropolis algorithm in this framework. Consider a state x e 5; 
a new state x' e Si{x) is generated by flipping a random bit of x, i.e. x' = -F(x, i) for i chosen uniformly 
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from {1, . . . , M}, and accepting the new state with probabihty: 

a = niin(l,e-^(^(^')-^(''») (2) 

If aU the single-flip neighbors of x, i.e. the states resuhing from x via a single bit perturbation, are of 
higher energy than i?(x), the acceptance rate will be small at low temperatures. 

3 SARDONICS 

In contrast to the traditional single-flip MCMC algorithms, the elementary unit of our algorithm is a type 
of move that allows for large changes of state and tends to propose them such that they are energetically 
favourable. We begin by describing the move operator and its incorporation into the Monte Carlo 
algorithm we call SARDONICS, an acronym for Self-Avoiding Random Dynamics on Integer Complex 
Systems. This algorithm will then be shown to satisfy the theoretical requirements that ensure correct 
asymptotic sampling. The move procedure aims to force exploration away from the current state. In that 
sense it has a similar spirit to the Tabu Search [18_ optimization heuristic, but the aim of SARDONICS 
is equilibrium sampling, a problem more general than (and in many situations at least as challenging as) 
minimization. 

Suppose that we have a state xg on S. It is possible to envision taking a special type of biased 
self-avoiding walk (SAW) of length k in the state space, in other words a sequence of states such that 
no state recurs in the sequence. In this type of SAW, states on sets {5i(xo) . . . 5fe(xo)} are visited 
consecutively. To generate this sequence of states (ui, . . . , u^), at step i of the procedure a bit is chosen 
to be flipped from those that have not yet flipped relative to Xq. Specifically, an element ai is selected 
from P(xo,Ui_i), with Uq = Xq, to yield state = _F(ui_i, ct;), or equivalently, = F{xo,ai, . . . ,ai). 
From Ui_i g 5i_i(xo), the set of states on Si that can result from flipping a bit in V{xq, Ui_i) are called 
the single-flip neighbours of u^-i on 5i(xo). The set of bits that can flip with nonzero probability are 
called the allowable moves at each step. A diagrammatic depiction of the SAW is shown in Figure [l] 

The elements {ui} are sampled in an energy-biased manner as follows: 



g-^E(F(ui_i,i)) / G 'P(xo,Uj_i) and 

E.ePOco...-!)-""^*^*"-^"" u,_i =F(xo,ai,...,(7,_i) 

otherwise 

The larger the value that simulation parameter 7 is assigned, the more likely the proposal / is to sample 
the lower-energy neighbors of u^^i. Conversely if it is zero, a neighbor on iSi(xo) is selected completely 
at random. In principle, the value of 7 will be seen to be arbitrary; indeed it can even be different at 
each step of the SAW. We wil have more to say about the choice of 7, as well as the related issue of the 
SAW lengths, in Section [42l 

The motivation behind using such an energy-biased scheme is that when proposing large changes of 
configuration, it may generate final states that are "typical" of the system's target distribution. To make 
a big state-space step, one may imagine uniformly perturbing a large number of bits, but this is likely to 
yield states of high energy, and an MCMC algorithm will be extremely unlikely to accept the move at 
low temperatures. 

At this point it can be seen why the term "self-avoiding" aptly describes the processes. If we imagine 
the state-space to be a high-dimensional lattice, with the individual states lying at the vertices and edges 
linking states that are single-flip neighbors, a self-avoiding walk on this graph is a sequence of states that 
induce a connected, acyclic subgraph of the lattice. In the move procedure we have described, a state 
can never occur twice at any stage within it and so the process is obviously self-avoiding. 

Note however that the construction imposes a stronger condition on the state sequence; once a tran- 
sition occurs from state x to state -F(x, i), not only may state x not appear again, but neither may any 
state y with yi = Xi. It seems natural to ask why not to use a less constrained SAW process, namely 
one that avoids returns to individual states and likely more familiar to those experienced in molecular 
and polymer simulations, without eliminating an entire dimension at each step. In our experience, try- 
ing to construct such a SAW as a proposal requires excessive computational memory and time to yield 
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Figure 1: A visual illustration of the move process for a SAW of length 3. The arrows represent the 
allowable moves from a state at that step; the red arrow shows the actual move taken in this example. 
With the system at state Xq, the SAW begins. Bit 4 of Xq has been sampled for flipping according to 
Equation [3] to yield state Ui — [11110]; the process is repeated until state U3 on 53(xo) is reached. The 
sequence of states taken by the SAW is cr = [4, 2, 5] . 

good state-space traversal. A local minimum "basin" of a combinatoric landscape can potentially contain 
a massive number of states, and a process that moves by explicitly avoiding particular states may be 
doomed to wander within and visit a substantial portion of the basin prior to escaping. 

Let Xi be the final state reached by the SAW, i.e. u^^ for a length k walk. By multiplying the SAW 
flipping probabilities, we can straightforwardly obtain the probability of moving from state Xq to Xi along 
the SAW (T, which we call /(xi,cr|xo): 

k 

/(xi, rrlxo) ^ [^(xo, t)] n f{<^M-i) (4) 

i=i 

The delta function simply enforces the fact that the final state xi must result from the sequence of flips 
in cr from xq. The set of {cr} such that /(xi, (t|xo) > are termed the allowable SAWs between Xq and 

Xi. 

Ideally, to implement a Metropolis-Hastings (MH) algorithm using the SAW proposal, we would like 
to evaluate the marginal probability of proposing xi from xo, which we call /(xi|xo), so that the move 
would be accepted with the usual MH ratio: 

f N A . 7I"(xi)/(xo|xi)\ 

a„j(xo, xi) = mm 1, — — — — — - (5) 
V 7r(xo)/(xi|xo)y 

Unfortunately, for all but small values of the walk lengths k, marginalization of the proposal is intractable 
due to the potentially massive number of allowable SAWs between the two states. 

To assist in illustrating our solution to this, we recall that a sufhcient condition for a Markov transition 
kernel K to have target tt as its stationary distribution is detailed balance: 

7r(xo)ii'(xi|xo) = 7r(xi)if(xo|xi) (6) 

One special case obtains if we used the marginalized proposal /(xi|xo) followed by the MH accept rule, 

ii:,„(xi|xo) ^ /(xi|xo)a„i(xo,xi) (7) 

As we cannot compute /(xi|xo), we shall use a kernel K{xi, (t\xo) defined on the joint space of SAWs 
and states, and show that with some care, detailed balance d6| can still hold marginally. It will be clear. 
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though that this does not mean that the resuhant marginal kernel K{xi\xo) is the same as that in ([t]) 
obtained using MH acceptance on the marginal proposal . 

Define the sequence reversal operator R{cr) to simply return a sequence consisting of the elements of 
cr in reverse order; for example i?([2, 3, 1, 4]) — [4,1,3,2]. One can straightforwardly observe that each 
allowable SAW cr from xq to xi can be uniquely mapped to the allowable SAW R{cr) from Xi to Xq. For 
example in Figure [l] the SAW R{cr) = [5, 2, 4] can be seen to be allowable from Xi to xg. Next, we have 
the following somewhat more involved concept, a variant of which we introduced in [26) : 

Definition 1. Consider a Markov kernel _ftr(xi, cr|xo) whose support set coincides with that of Q). We 
say that pathwise detailed balance holds if 

7r(xo)i^(xi,(T|xo) = 7r(xi)i4r(xo,i?(<T)|xi) 

for all cr, Xq, Xi. 

It turns out that pathwise detailed balance is a stronger condition than marginal detailed balance. In 
other words, 

Proposition 1. // the property in Definition^ holds for a transition kernel K of the type described there, 
then 7r(xo)iir(xi|xo) = 7r(xi)if (xo|xi) 

Proof. Suppose, for given xo,Xi, we summed both sides of the equation enforcing pathwise detailed 
balance over all allowable SAWs {cr'} from xq to Xi, i.e. 

^ 7r(xo)X(xi, ^'|xo) - ^(^i)Ki^o, i?(T')|xi) 

ct' cr' 

The left-hand summation marginalizes the kernel over allowable SAWs and hence results in 7r(xo)ii'(xi|xo). 
The observation above that each allowable SAW from Xq to Xi can be reversed to yield an allowable one 
from xi to Xq implies that the right-hand side is simply a re-ordered summation over all allowable SAWs 
from xi to Xq, and can thus be written as 7r(xi)i^(xo|xi). □ 

We are now ready to state the final form of the algorithm, which can be seen to instantiate a Markov 
chain satisfying pathwise detailed balance. After proposing (xi,(t) using the SAW process, we accept 
the move with the ratio: 

/ X A . 7r(xi)/(xo,i?(cr)|xi)\ 

a Xq, Xi, cr = mm 1, — ^ 8 

V 7r(xo)/(xi,cr|xo) / 

The computational complexity of evaluating this accept ratio is of the same order as that required to 
sample the proposed SAWs/state; the only additional operations required are those needed to evaluate 
the reverse proposal appearing in the numerator, which are completely analogous to those involved in 
calculating the forward proposal. 

Let us take a closer look at the marginal transition kernel i4r(xi|xo). We can factor the joint proposal 
into: 

/(xi,cr|xo) = /(xi|xo)/(cr|xo,xi) 

Of course, if we are assuming that /(xi|xo) is intractable to evaluate, then the conditional /(cr|xo,xi) 
must be so as well, but it is useful to consider. If we now summed both sides of the joint probability of 
moving from Xq to xi over allowable paths, we would observe: 

^ 7r(xo)-ftr(xi, cr'|xo) = 7r(xo)/(xi|xo) ^ /(cr'|xo, Xi)q;(xo, Xi, cr') 

(t' cr' 

The summation on the right-hand side is thus the conditional expectation of the accept rate given that 
we are attempting to move from xq to Xi; we call it 

a(xo, xi) = /(cr'|xo, xi)a(xo, xi, cr') (9) 

cr' 

and it defines an effective acceptance rate between xq and Xi under the sampling regime described since 
A'(xi|xo) = /(xi |xo)q;(xo, xi). It is not difficult to show that q;(xo,xi) =/= Q!m(xo,xi), i.e. the marginal 
accept rate for the joint proposal is not the same as the one that results from using the marginalized 
proposal. In fact we can make a stronger statement: 
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Proposition 2. For every pair of states (xo,xi), a(xo,Xi) < a,„(xo,xi) 
Proof. For conciseness, denote — . = 7r(xo)/(xi|xo) , ^, Be&ne the sets A = {erl ■^'^i^^'^^l^^'^;^ > C} 

■' ' Qm(xo,xi) Tr(xi)/(xo|xi) L I /(<t|xo,xi) — J 



and^^{H^S^<^}- Then 

a(xo,Xi) = y^/(cr^|xo,xi)min I 1 



^(xi)/(xo|xi)/(i?(<TO|xi,xo) 
7r(xo)/(xi|xo)/(cr'|xo,xi) 

#^ 'I ^ , ^(xi)/(xo|xi) ^ 

= E/('-l-o,x,)+ E/(^(.)|x„Xo) 

= /(<t'|xo,xi) + i 5] /(i?(^')|xi,xo) 



But by definition, for a ^ A, /(cr|xo,xi) < ^ /(i?(cr) |xi, xq). 
Tlierefore, 



a(xo,xi) < ^ ^ /(i?(*T')|xi,xo) + ^ ^ /(i?((T')|xi,xo 

^( 5] /(i?(<T')|xi,Xo)+ 5] /(i?(<T')|xi,Xo 

T'eA 

am(xo,Xi) 



1 

C 



□ 



Tliere is anotlier teclmical consideration to address. The reader may have remarked that while detailed 
balance does indeed hold for our algorithm, if the SAW length is constant at fc > 1, then the resulting 
Markov chain is no longer irreducible. In other words, not every x g 5 is reachable with nonzero 
probability regardless of the initial state. For example if fc = 2 and the initial state Xq = [0, 0, 0, 0, 0, 0], 
then the state x — [0, 0, 0, 0, 0, 1] can never be visited. Fortunately, this is a rather easy issue to overcome; 
one possible strategy is to randomly choose the SAW length prior to each step from a set that ensures 
that the whole state space can eventually be visited. A trivial example of such a set is any collection of 
integers that include unity, i.e. such that single-flip moves are allowed. Another is a set that includes 
consecutive integers, i.e. {fco, fco + 1} for any fcp < M . This latter choice could allow states separated by a 
single bit to occur in two steps; in the example above, if the set of lengths was {3, 4} then we could have 
[0, 0, 0, 0, 0, 0] — >■ [0, 0, 1, 1, 1, 1] [0, 0, 0, 0, 0, 1]. While this shows how to enforce theoretical correctness 
of the algorithm, in upcoming sections we will discuss the issue of practical choice of the lengths in the 
set. 

The experimental section discusses the practical matter of efficiently sampling the SAW for systems 
with sparse connectivity, such as the 2D and 3D Ising models. 



3.1 Iterated SARDONICS 

The SARDONICS algorithm presented in Section |3] is sufficient to work effectively on many types of 
system, for example Ising models with ferromagnetic interactions. This section, however, will detail a 
more advanced strategy for proposing states that uses the state-space SAW of Section[3]as its basic move. 
The overall idea is to extend the trial process so that the search for a state to propose can continue from 
the state resulting from a concatenated series of SAWs from Xg. The reader familiar with combinatorial 
optimization heuristics will note the philosophical resemblance to the class of algorithms termed iterated 
local search [29| . but will again bear in mind that in the present work, we are interested in equilibrium 
sampling as opposed to optimization. 

We begin by noting that restriction to a single SAW is unnecessary. We can readily consider in 
principle an arbitrary concatenation of SAWs (cri, <t2, . . . , ctn). A straightforward extension of the SAW 
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proposal is then to select some number of iterations N (which need not be the same from one move 
attempt to the next,) to generate xi from Xq by sampling from the concatenated proposal, defined to be 



g(xi,cri,. . . ,crjv|xo) = /(y\ en |xo)/(y^, cr2|y^) • • • , 

/(y ,crw|y )oxi(y ) 



and to accept the move with probability 



/ ^ ■ (1 7r(xi)5(xo,i?(crjv), . . .-R(cri) xi) , 

a(xo,Xi,cri...,cr7v)==mm 1, — ■ — r (11) 

' 7r(xo)5(xi,cri, . . . ,cr^ xo) ' 



In (10), the superscripted {y*} refer to the "intermediate" states on S generated by the flip sequences, 
and the functions / are the SAW proposals discussed in Section [3j The proposed state Xi is identical to 
the final intermediate state y^; to avoid obscuring the notation with too many delta functions we take 
it as implicit that the the proposal evaluates to zero for intermediate states that do not follow from the 
flip sequences {o-j}. 

We refer to this as the iterated SARDONICS algorithm. Its potential merit over taking a single SAW 
is that it may generate more distant states from Xq that are favorable. Unfortunately, a priori there is no 
guarantee that the final state Xi will be more worthy of acceptance than the intermediate states visited in 
the process; it is computationally wasteful to often propose long sequences of flips that end up rejected, 
especially when potentially desirable states may have been passed over. It is thus very important to 
choose the right value of N . For this reason, we will introduce Bayesian optimization techniques to adapt 
N and other parameters automatically in the following section. 



3.2 Mixture of SAWs 

A further addition we made to the basic algorithm was the generalization of the proposal to a mixture 
of SAW processes. Each segment of the iterated procedure introduced in Section |3.1| could in principle 
operate at a different level of the biasing parameter 7. A possible strategy one can envision is to occa- 
sionally take a pair of SAWs with the first at a small value of 7 and the next at a large value. The first 
step encourages exploration away from the current state, and the second a search for a new low-energy 
state. More specifically, for the two SAWs (cri,cr2), we can have a proposal of the form: 

/(xi,cri,cr2|xo) /(y\ cri|xo, 7l)/(xi, cr2|y\ 7//) 

where and 7^ are high and low inverse temperature biases respectively. Unfortunately, such a method 
on its own will likely result in a high rejection rate; the numerator of the MH ratio enforcing detailed 
balance will be: 

/(xo,i?(<T2),i?(*Ti)|xi) = /(yi,i?(^2)|xi,7L)/(xo,i?(<Ti)|y\7ff) 

The probability of taking the reverse sequences will likely be very small compared to those of the forward 
sequences. In particular, the likelihood of descending at low-temperature biasing parameter along the 
sequence R{cri), where ni was generated with the high-temperature parameter, will be low. 

Our simple approach to dealing with this is to define the proposal to be a mixture of three types of 
SAW processes. The mixture weights, Pll,Phl,Plh, with Pll + Plh + Phl = 1, define, respectively, 
the frequencies of choosing a proposal unit consisting of a pair of SAWs sampled with (7l,7l), {"/HtJl), 
and {jljJh)- The first proposal type encourages local exploration; both SAWs are biased towards low- 
energy states. The second one, as discussed, is desirable as it may help the sampler escape from local 
minima. The last one may seem somewhat strange; since it ends with sampling at 7^f, it will tend to 
generate states with high energy which will consequently be rejected. The purpose of this proposal, 
however, is to assist in the acceptance of the HL exploratory moves due to its presence in the mixture 
proposal. Thus, Plh is a parameter that must be carefully tuned. If it is too large, it will generate too 
many moves that will end up rejected due to their high energy; if too small, its potential to help the HL 
moves be accepted will be diminished. The mixture parameters are thus ideal candidates to explore the 
effectiveness of adaptive strategies to tune MCMC. 
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4 Adapting SARDORNICS with Bayesian Optimization 



SARDONICS has several free parameters: upper and lower bounds on the SAW length and ki respec- 
tively), iLi Pll, PhLt Plh as explained in the previous section, and finally the number of concate- 
nated SAWs N. We group these free parameters under the symbol 6 — kij'^HilL, Pll,Phl, Plh, N}. 
Each 9 defines a stochastic policy, where the SAW length k is chosen at random in the set [ki , and 
where the SAW processes are chosen according to the mixture probabilities Pll,Phl,Plh- Tuning 
all these parameters by hand is an onerous task. Fortunately, this challenge can be surmounted using 
adaptation. Stochastic approximation methods, at first sight, might appear to be good candidates for 
carrying out this adaptation. They have become increasingly popular in the subfield of adaptive MCMC 
pn [21 HH There are a few reasons, however, that force us to consider alternatives to stochastic 
approximation. 

In our discrete domain, there are no obvious optimal acceptance rates that could be used to construct 
the objective function for adaptation. Instead, we choose to optimize 6 so as to minimize the area under 
the auto-correlation function up to a specific lag. This objective was previously adopted in [2ll38]. One 
might argue that is is a reasonable objective given that researchers and practitioners often use it to 
diagnose the convergence of MCMC algorithms. However, the computation of gradient estimates for 
this objective is very involved and far from trivial This motivates the introduction of a gradient- 
free optimization scheme known as Bayesian optimization [431 13 • Bayesian optimization also has the 
advantage that it trades-off exploration and exploitation of the objective function. In contrast, gradient 
methods are designed to exploit locally and may, as a result, get trapped in unsatisfactory local optima. 

The proposed adaptive strategy consists of two phases: adaptation and sampling. In the adaptation 
phase Bayesian optimization is used to construct a randomized policy. In the sampling phase, a mixture 
of MCMC kernels selected according to the learned randomized policy is used to explore the target 
distribution. Experts in adaptive MCMC would have realized that there is no theoretical need for this two- 
phase procedure. Indeed, if the samplers are uniformly ergodic, which is the case in our discrete setting, 
and adaptation vanishes asymptotically, then ergodicity can still be established [Sit However, in our 
setting the complexity of the adaptation scheme increases with time. Specifically, Bayesian optimization, 
as we shall soon outline in detail, requires fitting a Gaussian process to / points, where / is the number 
of iterations of the adaptation procedure. In the worst case, this computation is 0{I^). There are 
techniques based on conjugate gradients, fast multipole methods and low rank approximations to speed 
up this computation [151 122] . However, none of these overcome the issue of increasing storage and 
computational needs. So, for pragmatic reasons, we restrict the number of adaptation steps. We will 
discuss the consequence of this choice in the experiments and come back to this issue in the concluding 
remarks. 

The two phases of our adaptive strategy are discussed in more detail subsequently. 
4.1 Adaptation Phase 

Our objective function for adaptive MCMC is the area under the auto-correlation function up to a 
specific lag. This objective is intractable, but noisy observations of its value can be obtained by running 
the Markov chain for a few steps with a specific choice of parameters Oi. Bayesian optimization can be 
used to propose a new candidate by approximating the unknown function using the entire history of 
noisy observations and a prior over this function. The prior distribution used in this paper is a Gaussian 
process. 

The noisy observations are used to obtain the predictive distribution of the Gaussian process. An 
expected utility function derived in terms of the sufficient statistics of the predictive distribution is 
optimized to select the next parameter value Oi+i- The overall procedure is shown in Algorithm [T] We 
refer readers to |9] and [37] for in-depth reviews of Bayesian optimization. 

The unknown objective function /i(-) is assumed to be distributed according to a Gaussian process 
with mean function m(-) and covariance function k{-,-): 

h{-)^GP{m{-),k{-,-)). 

We adopt a zero mean function m(-) = and an anisotropic Gaussian covariance that is essentially the 
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ALGORITHM 1: Adaptation phase of SARDONICS 



1: for i = 1,2, . . . ,/ do 

2: Run SARDONICS for L steps with parameters 0^. 

3: Use the drawn samples to obtain a noisy evaluation of the objective function: = h(Oi) + e. 
4: Augment the observation set "Di-i = {Di^i-i, (0^,2^)}. 
5: Update the CP's sufficient statistics. 

6: Find fli+i by optimizing an acquisition function: = arg maxg M(6|I'i;i). 

7: end for 



popular automatic relevance determination (ARD) kernel 



k{e,,ek) = exp y-i^{Oj - OkY diagW-)" - 0fe) 

where -0 e M"* is a vector of hyper-parameters. The Gaussian process is a surrogate model for the 
true objective, which involves intractable expectations with respect to the invariant distribution and the 
MCMC transition kernels. 

We assume that the noise in the measurements is Gaussian: Zi — h{Oi) + e, e ^ A/'(0, tr,^). It is 
possible to adopt other noise models ill:. Our Gaussian process emulator has hyper-parameters tp and 
cr^. These hyper-parameters are typically computed by maximizing the likelihood [49j . In Bayesian 
optimization, we can use Latin hypercube designs to select an initial set of parameters and then proceed 
to maximize the likelihood of the hyper-parameters iteratively |66l I55j . This is the approach followed 
in our experiments. However, a good alternative is to use either classical or Bayesian quadrature to 
integrate out the hyper-parameters [47[ I53j . 

Let Zi-i ^ A/'(0,K) be the i noisy observations of the objective function obtained from previous 
iterations. (Note that the Markov chain is run for L steps for each discrete iteration i. The extra index 
to indicate this fact has been made implicit to improve readability.) Zi:i and /li+i are jointly multivariate 
Gaussian: 



hi+i 



-AA 0, 



K 



k{e,e) 



where 



K 



fc(6»i,0i) 



fc(0„6>i) ... k{9,,e,) 

and k = [k{6, 6i) . . . k(0, Oi)]'^ ■ All the above assumptions about the form of the prior distribution and 
observation model are standard and less restrictive than they might appear at first sight. The central 
assumption is that the objective function is smooth. For objective functions with discontinuities, we need 
more sophisticated surrogate functions for the cost. We refer readers to [19] and 9J for examples. 

The predictive distribution for any value 6 follows from the Sherman-Morrison- Woodbury formula, 
where Vi-i = (61,^,2.1,1): 



aUe) = k{e,9)-k^{K 



Ik 



The next query point is chosen to maximize an acquisition function, u(6\'Di,i), that trades- 

off exploration (where crfiO) is large) and exploitation (where iJii{9) is high). We adopt the expected 
improvement over the best candidate as this acquisition function [4311561 15]. This is a standard acquisition 
function for which asymptotic rates of convergence have been proved |10| . However, we point out that 
there are a few other reasonable alternatives, such as Thompson sampling |41j and upper confidence 
bounds (UCB) on regret [59] . A comparison among these options as well as portfolio strategies to combine 
them appeared recently in [28]. There are several good ways of optimizing the acquisition function. 
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including the method of Divided RECTangles (DIRECT) of [H] and many versions of the projected 
Newton methods of [5]. We found DIRECT to provide a very efficient solution in our domain. Note that 
optimizing the acquisition function is much easier than optimizing the original objective function. This 
is because the acquisition functions can be easily evaluated and differentiated. 

4.2 Sampling Phase 

The Bayesian optimization phase results in a Gaussian process on the / noisy observations of the perfor- 
mance criterion zi-j, taken at the corresponding locations in parameter space 9i j. This Gaussian process 
is used to construct a discrete stochastic policy p{6\zi-j) over the parameter space 0. The Markov chain 
is run with parameter settings randomly drawn from this policy at each step. 

One can synthesize the policy p{6\zi.j) in several ways. The simplest is to use the mean of the GP 
to construct a distribution proportional to exp(/x(0)). This is the so-called Boltzmann policy. We can 
sample M parameter candidates 6i according to this distribution. Our final sampler then consists of a 
mixture of M transition kernels, where each kernel is parameterized by one of the 9i, i — 1, . . . , M. The 
distribution of the samples generated in the sampling phase will approach the target distribution 7r(-) as 
the number of iterations tends to cxi provided the kernels in this finite mixture are ergodic. 

In high dimensions, one reasonable approach would be to use a multi-start optimizer to find maxima 
of the unnormalized Boltzmann policy and then perform local exploration of the modes with a simple 
Metropolis algorithm. This is a slight more sophisticated version of what is often referred to as the epsilon 
greedy policy. 

The strategies discussed thus far do not take into account the uncertainty of the GP. A solution is to 
draw M functions according to the GP and then find he optimizer 0i of each of these functions. This is 
the strategy followed in [41^ for the case of contextual bandits. Although this strategy works well for low 
dimensions, it is not clear how it can be easily scaled. 

5 Experiments 

Our experiments compare SARDONICS to the popular Gibbs, block-Gibbs and Swendsen-Wang samplers. 
Several types of binary- valued systems, all belonging to the general class of undirected graphical model 
called the Ising model, are used. The energy of a binary state s, where Si G { — 1, 1} is given by: 

E{s) = - ^ JijSiSj - ^ hiS, 

(One can trivially map Xi G {0, 1} to Si G {—1, 1} and vice-versa.) The interaction weights Jij between 
variables i and j are zero if they are topologically disconnected; positive (also called "ferromagnetic" ) 
if they tend to have the same value; and negative ( "anti- ferromagnetic" ) if they tend to have opposite 
values. The presence of interactions of mixed sign can significantly complicate Monte Carlo simulation 
due to the proliferation of local minimas in the energy landscape. Interaction weights of different sign 
produce unsatisfiable constraints and cause the system to become "frustrated" . 

Parameters of SARDONICS (9) 



h ku 7L IH Pll Phl Plh N 



Range {!,..., 70} {2, . . . , 120} [0.89,1.05] [0.9,1.15] [0,1] [0,1] [0,1] {1,...,5} 

Table 1: The ranges from which Bayesian Optimization chooses parameters for SARDONICS. The selec- 
tion mechanism ensures that ku > ki and jh 5^ 7l- 

The first set of experiments considers the behavior of Gibbs, SARDONICS and Swendsen-Wang on a 
ferromagnetic Ising model on a planar, regular grid of size 60 x 60. The model has connections between 
the nodes on one boundary to the nodes on the other boundary for each dimension. As a result of these 
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Figure 2: The 2D Ferromagnetic model with periodic (toroidal) boundaries [top left], the auto-correlations 
of the samplers for T — 2.27 (critical temperature) [top right], traces of every 5 out of the 10^ samples 
of the energy [bottom left] and rewards obtained by the Bayesian optimization algorithm during the 
adaptation phase [bottom right]. 



periodic boundaries, the model is a square toroidal grid. Hence, each node has exactly four neighbors. 
In the first experiment, the interaction weights, Jij, are all 1 and the biases, hi, are all 0. We test the 
three algorithms on this model at three different temperatures: 1, 2.27 and 5. The value (3 = 1/2.27 
corresponds to the so-called critical temperature, where many interesting phenomena arise |46| but where 
simulation also becomes quite difficult. 

The experimental protocol for this and subsequent models was the same: For 10 independent trials, 
run the competing samplers for a certain number of iterations, storing the sequence of energies visited. 
Using each trial's energy sequence, compute the auto- correlation function (ACF). Comparison of the 
algorithms consisted of analyzing the energy ACF averaged over the trials. Without going into detail, a 
more rapidly decreasing ACF is indicative of a faster-mixing Markov chain; see for example [HQ. For all 
the models, each sampler is run for 10^ steps. For SARDONICS, we use the first 2 x 10'' iterations to 
adapt its hyper-parameters. For fairness of comparison, we discard the first 2 x 10"* samples from each 
sampler and compute the ACF on the remaining 8 x 10^ samples. For all our experiments, we use the 
ranges of parameters summarized in Table I to adapt SARDONICS. 

As shown in Figure [2j at the critical temperature, Swendsen-Wang does considerably better than its 
competitors. It is precisely for these types of lattice that Swendsen-Wang was designed to mix efficiently, 
through effective clustering. This part of the result is therefore not surprising. However, we must consider 
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Figure 3: Auto-corrclations of the samplers, on the Ferromagnetic 2D grid Ising model, for T ~ I [top 
left] and T = 5 [top right]. Traces of every 5 out of the 10^ samples of the energy at T = 1 [bottom left] 
and T = 5 [bottom right] . 



the performance of SARDONICS carefully. Although it does much better than Gibbs, as expected, it 
under-performs in comparison to Swendsen-Wang. This seems to be a consequence of the fact that the 
probability distribution at this temperature has many peaks. SARDONICS, despite its large moves, can 
get trapped in these peaks for many iterations. At temperature 5, when the distribution is flattened, 
the performance of SARDONICS is comparable to that of Swendsen-Wang as depicted in Figure [3] The 
same figure also shows the results for T = 1, where the target distribution is even more peaked. The good 
performance of SARDONICS for T = 1 might seem counterintuitive considering the previous results for 
T — 2.27. An explanation is provided subsequently. 

At temperatures 1 and 2.27, adaptation is very hard. Before the sampler converges, it is beneficial 
for it to take large steps to achieve lower auto-correlation. The adaptation mechanism learns this and 
hence automatically chooses large SAW lengths. But after the sampler converges, large steps can take 
the sampler out of the high-probability region thus leading to low acceptance. If the sampler hits the 
peak during adaptation then it learns to choose small SAW lengths. This may also be problematic if we 
restart the sampler from a different state. This points out one of the dangers of having finite adaptation 
schemes. A simple solution, in this particular case, is to change the bounds on the SAW lengths manually. 
This enables SARDONICS to achieve performance comparable to that of Swendsen Wang for these nearly 
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Figure 4: Frustrated 2D grid Ising model with periodic boundaries [top left], auto-correlations of the 
three samplers [top right], traces of the last 20000 of 100000 samples of the energy [bottom left] and 
rewards obtained by the Bayesian optimization algorithm during the adaptation phase [bottom right]. 



deterministic models, as shown in Figure [3] for T = 1. However, ideally, infinite adaptation mechanisms 
might provide a more principled and general solution. 

In addition to studying the effect of temperature changes on the performance of the algorithms, we 
also investigate their sensitivity with respect to the addition of unsatisfiable constraints. To accomplish 
this, we set the interaction weights Jij and the biases hi uniformly at random on the set {—1, 1}. We 
set the temperature to T = 1.0. We refer to this model as the frustrated 2D grid Ising model. As shown 
by the auto-correlations and energy traces plotted in Figure |4] SARDONICS does considerably better 
than its rivals. It is interesting to note that Swendsen-Wang does much worse on this model as the 
unsatisfiable constraints hinder effective clustering. The figure also shows the reward obtained by the 
Bayesian optimization scheme as a function of the number of adaptations. The adaptation algorithm 
traded-off exploration and exploitation effectively in this case. 

The third batch of experiments compares the algorithms on an Ising model where the variables are 
topologically structured as a 9 x 9 x 9 three-dimensional cube, Jij are uniformly sampled from the set 
{— 1, 1}, and the hi are zero. /3 was set to 1.0, corresponding to a lower temperature than the value of 
0.9, at which it is known [35] that, roughly speaking, regions of the state space become very difficult to 
visit from one another via traditional Monte Carlo simulation. Figure [5] shows that for this more densely 
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Figure 5: Frustrated 3D cube Ising model with periodic boundaries (for visualization simplicity, the 
boundary edges are not shown) [top left], auto-correlations of the three samplers [top right], traces of 
the last 20000 of 100000 samples of the energy [bottom left] and rewards obtained by the Bayesian 
optimization algorithm during the adaptation phase [bottom right] . 



connected model, the performance of Swendsen-Wang deteriorates substantially. However, SARDONICS 
still mixes reasonably well. 

While the three-dimensional-cube spin-glass is a much harder problem than the 2D ferromagnet, it 
represents a worst case scenario. One would hope that problems arising in practice will have structure 
in the potentials that would ease the problem of inference. For this reason, the third experimental set 
consisted of runs on a restricted Boltzmann machine |58| with parameters trained from natural image 
patches via stochastic maximum likelihood [HIlllO]- RBMs are bipartite undirected probabilistic graphical 
models. The variables on one side are often referred to as "visible units", while the others are called 
"hidden units" . Each visible unit is connected to all hidden units. However there are no connections 
among the hidden units and among the visible units. Therefore, given the visible units, the hidden units 
are conditionally independent and vice-versa. Our model consisted of 784 visible and 500 hidden units. 
The model is illustrated in Figure |6] 

The pre- learned interaction parameters capture local regularities of natural images |31) . Some of these 
parameters are depicted as images in Figure |8] The parameter /? was set to one. The total number of 
variables and edges in the graph were thus 1284 and 392000 respectively. 
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Figure 6: Frustrated 3D cube Ising model with periodic boundaries [top left], auto-correlations of the 
three samplers [top right], traces of the last 20000 of 100000 samples of the energy [bottom left] and 
rewards obtained by the Bayesian optimization algorithm during the adaptation phase [bottom right]. 



Figures [6] and [T] show the results. Again SARDONICS mixes significantly better than Swendsen-Wang 
and the naive Gibbs sampler. Swendsen-Wang performs poorly on this model. As shown in Figure [7j it 
mixes slowly and fails to converge after 10^ iterations. 

For this bipartite model, it is possible to carry out block-Gibbs sampling (the standard method 
of choice). Encouragingly, SARDONICS compares well against this popular block strategy. This is 
important, because computational neuro-scientists would like to add lateral connections among the hidden 
units, in which case block-Gibbs would no longer apply unless the connections form a tree structure. 
SARDONICS thus promises to empower computational neuro-scientists to address more sophisticated 
models of perception. The catch is that at present SARDONICS takes considerably more time than 
block-Gibbs sampling for these models. We discuss this issue in greater length in the following section. 

Finally, we consider a frustrated 128-bit chimera lattice that arises in the construction of quantum 
computers [7]. As depicted in Figure [9j SARDONICS once again outperforms its competitors. 
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Figure 7: Trace of the last 20000 of 100000 samples of the energy for the Swendsen-Wang sampler. As 
we can see, the sampler performs poorly on the RBM model. 
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Figure 8: RBM parameters. Each image corresponds to the parameters connecting a specific hidden unit 
to the entire set of visible units. 

5.1 Computational considerations 

The bulk of the computational time of the SARDONICS algorithm is spent in generating states with the 
SAW proposal. At each step of the process, a component from a discrete probability vector, corresponding 
to the variable to flip, must be sampled. Naively, the time needed to do so scales linearly with the length 
/ of the vector. In graphical models of sparse connectivity, however, it is possible achieve a dramatic 
computational speedup by storing the vector in a binary heap. Sampling from a heap is of O(logZ), but 
for sparsely connected models, updating the heap in response to a flip, which entails replacing the energy 
changes that would result if the flipped variable's neighboring variables were themselves to flip, is also 
of logarithmic complexity. In contrast, for a densely connected model, the heap update would be of 
0(Aflog/) where M is the maximum degree of the graph, while recomputing the probability vector in 
the naive method is 0{M). The simple method is thus cheaper for dense models. In our experiments, we 
implemented SARDONICS with the binary heap despite the fact that the RBM is a densely connected 
model. 

For densely connected models, one could easily parallelize the computation of generating states with 
the SAW proposal. Suppose we have n parallel processes. Each process P holds one section Up of the 



16 









5000 10000 


15000 20 



g 0.6 



50 100 150 

Number of Adaptations 



Figure 9: Chimera lattice [top left], auto-correlations of the three samplers [top right], traces of the last 
20000 of 100000 samples of the energy [bottom left] and rewards obtained by the Bayesian optimization 
algorithm during the adaptation phase [bottom right] . 



unnormalized probability vector U. To sample from the discrete vector in parallel, each parallel process 
can sample one variable Vp to flip according to Up. Each process also calculates the sum of its section of 
the unnormalized probability vector Sp. Then the variable to flip is sampled from the set {vp : 1 < p < n} 
proportional to the discrete probability vector [si, S2, s„]. To update U in response to a flip, we could 
use the naive method mentioned above. If we divide the U evenly among processes and assume equal 
processor speed, sampling from and updating the unnormalized probability vector would take 0{ +n) 
operations. When n is small compared to M + /, we could achieve near linear speed up. 

We compare, in table II, the amount of time it takes to draw 10^ samples using different samplers. 
In principle, it is unwise to compare the computational time of samplers when this comparison depends 
on specific implementation details and architecture. The table simply provides rough estimates. The 
astute reader might be thinking that we could run Gibbs for the same time as SARDONICS and expect 
similar performance. This is however not what happens. We tested this hypothesis and found Gibbs to 
still underperform. Moreover, in some cases, Gibbs can get trapped as a result of being a small move 
sampler. 

The computational complexity of SARDONICS is affected by the degree of the graph, whereas the 
complexity of Swendsen-Wang increases with the number of edges in the graph. So for large sparsely- 
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Samplers 



Models 



SARDONICS Swendsen-Wang 



Gibbs 



Block Gibbs 



Ferromagnetic Ising Model 
Frustrated 2D Ising Model 
Frustrated 3D Ising Model 



20 minutes 
20 minutes 
10 minutes 



a few minutes 
a few hours 
tens of seconds 



50 minutes 
50 minutes 



tens of seconds N/A 
tens of seconds N/A 
tens of seconds N/A 



RBM 
Chimera 



a few minutes 



10 hours 



20 minutes a few minutes 
tens of seconds N/A 



Table 2: Rough computation time for each sampler on different models. All samplers are run on the 

same computer with 8 CPUs. The Swendsen-Wang sampler is coded in Matlab with the computationally 
intensive part written in C. The SARDONICS sampler is coded in Python also with its computationally 
intensive part coded in C. The Gibbs and block Gibbs samplers are coded in Python. The Swendsen-Wang 
and block Gibbs samplers take advantage of parallelism via parallel numerical linear algebra operations. 
The SARDONICS sampler and the Gibbs sampler, however, run on a single CPU. The adaptation time 
of SARDONICS is also included. 

connected graphs, we expect SARDONICS to have a competing edge over Swendsen-Wang, as already 
demonstrated to some extent in the experiments. 

6 Discussion 

The results indicate that the proposed sampler mixes very well in a broad range of models. However, 
as already pointed out, we believe that we need to consider alternatives to nonparametric methods 
in Bayesian optimization. Paranicitric methods would enable us to carry out infinite adaptation. This 
should solve the problems pointed out in the discussion of the results on the ferromagnetic 2D Ising model. 
Another important avenue of future work is to develop a multi-core implementation of SARDONICS as 
outlined in the previous section. 
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